Sparse and silent coding in neural circuits 



Andras Lorincz 1 , Zsolt Palotai 2,3 and Gabor Szirtes 1 

1 Eotvos Lorand University, Pazmany Peter setany 1/C, Budapest, Hungary, H-1117 

2 Sparsense Inc., USA 
3 ELTE-Soft kft., Pazmany Peter setany 1/C, Budapest, Hungary, H-1117 

October 19, 2010 
Abstract 

Sparse coding algorithms are about finding a linear basis in which signals can be represented 
by a small number of active (non-zero) coefficients. Such coding has many applications in sci- 
ence and engineering and is believed to play an important role in neural information processing. 
However, due to the computational complexity of the task, only approximate solutions provide 
the required efficiency (in terms of time) . As new results show, under particular conditions there 
exist efficient solutions by minimizing the magnitude of the coefficients ('Zi-norm') instead of 
minimizing the size of the active subset of features ('Z -norm'). Straightforward neural imple- 
mentation of these solutions is not likely, as they require a priori knowledge of the number of 
active features. Furthermore, these methods utilize iterative re-evaluation of the reconstruction 
error, which in turn implies that final sparse forms (featuring 'population sparseness') can only 
be reached through the formation of a series of non-sparse representations, which is in contrast 
with the overall sparse functioning of the neural systems ('lifetime sparseness'). In this arti- 
cle we present a novel algorithm which integrates our previous 'Zo-norm' model on spike based 
probabilistic optimization for sparse coding with ideas coming from novel '^i-norm' solutions. 

The resulting algorithm allows neurally plausible implementation and does not require an 
exactly defined sparseness level thus it is suitable for representing natural stimuli with a varying 
number of features. We also demonstrate that the combined method significantly extends the 
domain where optimal solutions can be found by 'Zi-norm' based algorithms. 

Keywords, ^i-norm, cross entropy method, sparse coding. 

1 Introduction 



To cope with uncertainty in the observations, neural systems are supposed to implement a generative 
model [32] for explaining the signals with a set of hypothesis about the world. The models can 
then be used to predict future events or spatio-temporal patterns of observations. Of the many 
possibilities, the so called sparse coding models seem to get the strongest support from theory 
(e.g. [36]) as well as from experiments (e.g. [52[[59], but see, e.g., [2]). Regarding neural activity, 
sparse coding refers to a special form of representation where a small number of neurons are active 
at a time and the rest are either silent or show weak activity. [13] • Population codes of this sparse 
form are suggested to approximate signals by a linear superposition of the smallest possible subset 
of features (also called basis functions) out of a given 'dictionary' of features [38] . 
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Formally, the goal of sparse coding is to represent a signal as a sparse linear combination of 
basis elements: 

mm a ||a||o subject to x = <&a (1-1) 

where x 6 M. N is the signal to be reconstructed, coefficient vector a € is the subject of 
sparsification and D £ K A ' >!M denotes the so called generative matrix, also referred to as top-down 
or reconstruction matrix or dictionary. Columns of matrix D are unit norm basis vectors. Sparsity 
is measured by the Iq pseudo-norm (denoted by ||.||o), which counts the number of nonzero elements 
in a vector: ||a||o = s.t. Oj ^ 0}. 

It has been argued that sparseness is required to decrease the metabolic cost of spiking, re- 
gardless the statistics of the input signals [IB] . For most natural stimuli, the underlying (hidden) 
statistics of the signals is sparse (see, e.g. [371[52]), so it is plausible to seek appropriate sparse 
representations for them. One of the most important features of sparse coding is that it can be 
used to learn overcomplete dictionaries, when the number of features is larger than the dimension 
of the stimuli (that is M > N). Neural populations are usually highly overcomplete which may 
offer many advantages like improved storage capacity or additional decrease of metabolic cost [39J. 

Unfortunately, all these favorable properties come at a cost: finding optimally sparse representa- 
tion of a signal using an overcomplete dictionary is a so-called 'NP-hard' combinatorial problem [33] . 
Informally there is no guaranty that any approach would provide a solution to all problems in this 
form in polynomial time. An algorithm is said to be solvable in polynomial time if the number of 
steps required to complete the algorithm for a given input is 0{d n ) for some nonnegative integer n, 
where d is the dimension of the input. Polynomial algorithms are considered 'fast' in theory, but 
in practice there are significant differences among the solutions. For sparse coding, the complexity 
of the problem is related to the required sparsity level: k = K/M, where M is the size of the 
representation, and K is the number of active components. Because exact solutions are difficult 
to obtain, different approximate solutions have been proposed, like in [I5l[37]. Since sparse coding 
is very important in many computation intensive tasks (e.g. bioinformatics, computer vision, data 
mining), a lot effort has been put into the development of new methods which can solve at least 
some problems efficiently. Recent studies [H[13] which can be traced back to [38] have reported 
a surprising discovery: under certain conditions sparse signals can be exactly reconstructed via 
solving the related problem of minimization in /i-norm (For a comprehensive list on the conditions 
under which different solutions may succeed, see [56] ). In other words the same results can be 
obtained by minimizing the sum of the absolute values of the coefficients instead of minimizing the 
number of non-zero coefficients (minimization in Iq norm). These methods have proven robust in 
the sense that even noisy or heavily undersampled signals can be recovered this way. This obser- 
vation relaxes the combinatorially hard problem, yet the time complexity of the best solutions is 
still scales with the 3 rd power in the dimension of the input. Since cubic scaling is still prohibitive 
for real life applications, several approximate solutions for Zi-norm based sparse coding have been 
proposed, which may not guarantee to provide the optimal solution, but at least they are fast as 
approaching linear complexity. 

In addition to the problem of computational complexity, there are other concerns for which 
most methods could not be directly implemented in a parallel, neurally plausible manner. As it 
was stated in [45] most sparse coding methods share the following properties: (1) they include neu- 
rally implausible (centralized, non-local) operations, (2) they fail to produce exactly zero-valued 
coefficients in finite time (sub-optimal solutions), (3), for time- varying stimuli they produce non- 
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smooth variations in the coefficients and (4) , they use only a heuristic approximation to minimizing 
a desired objective function. In our view, however, there are two additional concerns that render 
most sparse coding approaches neurally implausible. First, efficient methods require a common 
predefined sparseness level for all inputs. Second, mapping between input and the sparse repre- 
sentation requires a tremendous amount of synaptic transfers and for overcomplete representations 
the sheer amount of bottom-up (that is input— 7-sparse representation) calculations may become 
prohibitive from a metabolic point of view, (dendritic propagation, excitatory postsynaptic poten- 
tials and transmitter release) |21] a neurally plausible sparse coding system should also minimize 
the required synaptic transfers when seeking a solution. This requirement essentially implies that 
during the process of sparsification the transient activities should also be sparse thus maintaining 
lifetime sparseness (i.e. sparse activity of the individual neurons over time) |61j . 

In this study we address these concerns by integrating two approaches for sparse coding. The 
core of the proposed solution is based on our earlier method [25] which utilizes probabilistic combina- 
torial optimization [36] and explicit sparsification by minimizing in lo-norm. The other constituent 
is a variation of fast, but non-optimal Zi-norm based methods whose contribution is an acceleration 
in feature subset selection. The resulting method not only addresses the issues articulated by [45] . 
but it also minimizes synaptic transfer between layers maintaining different representations of the 
signal and does not require a strict predefined value for the sparseness level. Our contribution is 
that we introduce a novel algorithm, which (1) exploits the favorable properties of ^i-norm based 
solutions, (2) can be implemented in a neurally plausible (online, parallel and local) fashion and 
(3) is efficient in terms of computational time and metabolic cost. The focus is now on form- 
ing sparse representations, so we do not discuss methods to learn the feature set (but see, for 
example, [20l[2l[38l[6Q] on this matter). 

In order to fully expose our ideas, we briefly review in Section [2] the main classes of Zi-norm 
based sparse coding methods. We then shortly review a probabilistic optimization scheme for 
combinatorial problems called Cross-Entropy (CE) method [46] which can also be applied in sparse 
coding problems by explicitly minimizing the number of non-zero coefficients [25]. This section ends 
with the presentation of the integrated solution. In Section [3] we demonstrate the superiority of our 
method over established methods on artificial, but large-scale problems when the exact solution 
is known. In Section [4] we discuss the relevance of our proposal to distributed computations. We 
also discuss correspondences between the proposed algorithm and the underlying computations in 
sensory processing. Conclusions are drawn in Section [5l In the Appendix we detail the pseudo-code 
of the ^i-norm based and the CE methods as the main constituents of our integrated solution. 

2 Methods 

In this section we present our notations and describe the so-called ^i-norm based solutions to the 
problem of sparse coding as defined by Eq. 11.11 In addition, cross entropy method is presented, 
which is provably convergent, has been very efficient in probabilistic combinatorial optimizations 
and can be used for explicit sparsification by minimizing in Zo- norm - Finally we present an inte- 
grated solution, which is believed to inherit many favorable properties of the Iq and ^i-norm based 
methods. 
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2.1 Notation 




, . . . , clm) T G M A/ , where T stands for transposition, be 



defined as 



We use calligraphic letters for index series: J := {(ji, ■ ■ ■ , jj) : ji, ■ ■ ■ , jj G {1, . . . , M}, J < M} 
where the corresponding plain capital letter denotes the size of the index series and indices are 
between and M. 

For any Z + 3 K{< M) and index series K, = (ki, . . . , kx) (where k n G {1, . . . , M} and k n ^ k m 
if n 7^ m) let a[/C] denote the /C-indexed components of a G ]R M , i.e., a[/C] = (a^, ■ ■ ■ ,cik K ) T ■ 
Similarly, for matrix D € ]R ArxM and for series K = (fci, . . . , kx ) (k n ^ k rn G {1,...,M}), let 
D[/C] = (d fcl ,...,d fcx ). 

For a vector a G W M let us denote the indices of the components arranged in decreasing 
order by Maxlnd^a) , i.e., Maxlndyc(a) = (k±, . . . , kx) with > a^ m for all k n < k m and 
n, m G {1, . . . , X}. We will refer to this set as 'ordered index series'. Then for matrix D G ~R NxM , 
matrix D [Maxlnd/c(a)] is in M> NxK and the columns of this matrix are ordered according to the 
values of the first K largest components of vector a. We will call D [Maxlnd/c(a)] as 'K-truncated 
matrix'. 

Pseudo-inverse of matrix D is defined as = (D T D) _1 D T . A single superscript for matrices 
and vectors denotes the iteration number: a* means vector a at iteration i. For notational simplicity, 
iteration number will be dropped and updates will be denoted by an arrow pointing to the left (<—), 
e.g., for update a l+1 = a* + c* we use a a + c if no confusion may arise. 

2.2 Solution via linear programming 

Heuristics (e.g. as in [37]) developed for sparse coding problems are now getting replaced by meth- 
ods based on the Zi-norm substitution [45^49]. In doing so the difficult combinatorial optimization 
problem can be recast into a much easier convex optimization task, for which there exist efficient 
interior point-type linear programming [3] techniques. This version of the optimization reads as 



where A is a trade-off parameter controlling the balance between the reconstruction quality and 
sparsity. For more details, see, e.g. [8]. 

2.3 Iterative methods 

The downside of simple linear programming based solutions is that their remarkable polynomial 
computational complexity is cubic at best, which is still prohibitive for large scale applications. 
The need for faster decoders - aiming at linear time operation - has brought about improved linear 
programming methods (e.g. linear programming with preconditioned conjugated gradients |18J ) 
as well as other classes not based on convex programming. Most importantly, a family of greedy 
algorithms (e.g. |35U58j ) - following the ideas of the Matching Pursuit (MP) approach [30] - became 
dominant due to their low complexity and simple geometric interpretation. 



min ||a||i subject to x = Da 



(2.1) 



a 



or in a relaxed form for the noisy case: 



min(||x - Da||| + A||a||i) 



(2.2) 
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The core idea is that reconstruction is built up iteratively: the best element of the dictionary 
is selected to minimize the residual at iteration /: = x — D[Z]a J , that is the difference between 
the signal and its actual estimation made of the combination of the previously chosen elements of 
the dictionary. I denotes the corresponding index series of the already selected basis functions, 
D[Z] denotes the matrix made of the active basis functions and thus forming a restricted top-down 
reconstruction matrix, whereas a 1 represents the corresponding (optimized) multipliers. Although 
MP is fast, its approximation performance can be quite poor. Improved solutions, like Orthogonal 
Matching Pursuit (OMP) methods provide better performance, but usually require longer running 
times. In general the iteration of the different greedy methods consists of the following two steps: 

1. Select: Find basis vector that provides the largest projection of the residual and whose 

index is not yet contained in I: 

b = argmax || dj"r 7 1| . (2-3) 

m 

Expand the index set with this new index: X <— X U b. 

2. Improve basis set and update: Expand the sparse basis: D[I] <— D[X] U d^. Compute the 

new approximation, i.e., the coefficients a /+1 and the residual r /+1 by minimizing the ap- 
proximation error, ||r /+1 ||2 according to the applied method. 

Iteration may stop when the allowed number of dictionary elements is reached or when the approx- 
imation error is reduced below a predefined threshold. Note that MP and OMP methods differ in 
their update procedure [28] : 

MP: r <r- r - d b d^r, 

OMP: r <- r - D[X]D[X]t r (= x - D[X]D[X]t x ). 

In MP not only the choice of the feature subset is suboptimal, but also the approximation of the 
coefficients. OMP improves upon MP by recomputing all coefficients at every step in order to ensure 
orthogonality between the residual and the columns of D [I] . While OMP methods achieve better 
reconstruction, they impose stronger constraints on the matrices than the original theorem for li- 
norm based optimization does (see, e.g. [TO]). Further acceleration can be achieved if more than 1 
basis functions are chosen at every iteration as e.g. in the 'stagewise' OMP |12j . Nevertheless, all 
of these iterative procedures are somewhat limited by the following issue. Once a verdict is made, 
the chosen features remain in the subset until the algorithms terminates. One mistake thus may 
tend to influence the reconstruction quality for many subsequent iteration steps. 

2.3.1 Subspace pursuit (SP) methods 

A remedy for the above mentioned problem has been independently proposed in [10] and |34] . These 
methods assume that at most K components are sufficient to represent the input. The methods 
enlarge the subset of candidate features by K [10] (or 2K [M]) features and then decrease their 
number back to K at every iteration. These methods thus work with subspaces. The method of [10] 
is as follows. First, at iteration i matrix T) T and residual r* -1 are used to increase the feature set: 
features corresponding to the K largest values of Y) T r l ~ l are retained and added to the previously 
selected feature set, D*" 1 . Second, all the selected IK features are used for reconstruction using 
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the pseudoinverse and the set is then reduced again by retaining only the K largest features. Last, 
the new residual is computed with the retained features. For completeness, the corresponding 
pseudocode of SP of [10] is provided in the Appendix (Table [3|) . 

2.4 Subspace cross-entropy (SCE) method for combinatorial optimization 

While the different Zi-norm based optimization methods are getting more efficient regarding their 
computational time complexity, their neural implementation seems quite limited or even impossible 
due to some native shortcomings of the algorithms. On the one hand, linear programming based 
methods employ centralized and non-local transformations during the optimization process (see, 
in 05]). 

On the other hand, iterative methods, like MP, OMP, and SP rely on matrix-vector multipli- 
cations, which are -in theory- suitable for local rule based implementations. However, in all of 
these methods the transposed form of the reconstruction matrix D is used in all iteration steps. 
While reconstruction of the original signal uses only K x N channels (top-down connections or 
active synapses in neural networks) and thus only K x N multiplications are required, selecting or 
updating the coefficients requires all N x M bottom-up channels to transmit signals between the 
layers. Considering the significant metabolic cost of synaptic activation, the huge M/K ratio may 
suggest that ecological computations should prefer K x N top-down reconstructions over N x M 
bottom-up transformations. 

There is another issue, too, which is beyond simple linear modeling of the signals. As it was 
emphasized in the Introduction, neural systems must be endowed with predictive capacity, for 
which probabilistic modeling is needed. In addition, due to the complex interplay between different 
noise sources (e.g. exogenous source noise generated by the world and endogenous channel noise 
generated by the system itself) neural systems often rely on different 'modalities' (parallel signal 
channels) and on their internal generative model to 'fill in' missing parts of the signals or correct 
assumably corrupted ones. In turn, purely deterministic representational models just do not provide 
enough information to infer e.g. about the reliability of a given feature. 

We thus pursue an alternative approach which (1) minimizes the number of costly bottom-up 
signal transmission and (2) provides probability estimations yet it can efficiently solve the problem 
of sparse coding. Our solution combines the subspace pursuit idea with the cross-entropy (CE) 
method for combinatorial optimization, which directly tries to minimize the number of active 
components (/o-norm based optimization). 

2.4.1 Cross-entropy (CE) method 

The CE method is a global optimization technique |47] aiming to find the solution in the following 
form: 

y* := argmin/(y). 
y 

where / is a general objective function. The CE method resembles the estimation-of-distribution 
evolutionary methods (see e.g. [31]). As a global optimization method, it provably converges to 
the optimal solution [311137]. This method has been successfully applied in different problems 
like optimal buffer allocation, pQ, DNA sequence alignment [T7], reinforcement learning [53] or 
independent process analysis [53] . 

While most optimization algorithms maintain a single candidate solution y(t) at each time step, 
the CE method maintains a distribution over possible solutions. From this distribution, solution 
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candidates are drawn at random. By continuous modification of the sampling distribution, random 
guess becomes a very efficient optimization method. 

One may start by drawing many samples from a fixed distribution g and then selecting the best 
sample as an estimation of the optimum. The efficiency of this random guessing depends greatly 
on the distribution g from which the samples are drawn. After drawing a moderate number of 
samples from distribution g, we may not be able to give an acceptable approximation of y*, but we 
may still obtain a better sampling distribution. The basic idea of the CE method is that it selects 
the best few samples, and modifies g so that it becomes more similar to the empirical distribution 
of the selected samples. 

For many parameterized distribution families, the parameters of the minimum cross-entropy 
member can be computed easily from simple statistics of the elite samples. For completeness we 
provide the formulae and the corresponding pseudocode for Bernoulli distributions in the Appendix, 
as these will be needed in the sparse feature subset optimization. Derivations as well as a list of 
other discrete and continuous distributions with simple update rules can be found in 

2.5 The SCE method 

Among the iterative methods, SP seems to provide superior speed, scaling and reconstruction ac- 
curacy over other methods by directly refining the subset of reconstructing (active) components at 
each iteration. Its native shortcomings, though, are the heavy use of the costly, bottom-up trans- 
formation of the residuals at each iteration and the preset number of sparsity k. On the other hand, 
CE which is less affected by the costly bottom-up transformations, updates the probability of all 
active components similarly, regardless their individual contributions to the actual reconstruction 
error. In turn, we propose an improvement over both methods which inherits the flexibility and 
synaptic efficiency of CE and the superior speed and scaling properties of SP without their short- 
comings. The improved model can be realized by inserting an intermediate control step in CE to 
individually update the component probabilities based on their contribution to the reconstruction 
error. Hence the explicit refinement of SP is replaced by an implicit modification through the 
component probabilities. The pseudocode of the combined method is given in Table. [TJ 

As the resulting algorithm is not a greedy method we call the algorithm as Subspace Cross 
Entropy (SCE) method without the term 'Pursuit'. In comparison with the algorithms of the 
original online CE method (Fig. la in [25]) and the SP method there are two major differences. 
With regard to the online CE method, SCE takes advantage of the SP idea and updates the 
Bernoulli distribution with the help of the reconstruction error: the update of the probability of 
the basis functions is proportional to their contribution to the representation of the reconstruction 
error. With regard to the SP method, SCE is flexible: it is not restricted to K elements. In 
order to keep the /o- norm of the representation around K, probability vector pt is normalized to 
K. In every iteration SCE draws random samples from the probability vector. It might turn out 
that elite samples have more than K Is, so SCE is not restricted by this parameter. Furthermore, 
one may adjust this parameter since the A trade-off parameter of cost function (|2.2p controls the 
balance between sparseness and reconstruction quality. Such adaptive tuning is not included in the 
pseudocode (Table CQ) for the sake of simplicity 

In this setting K becomes a soft, tunable parameter dependent on the actual input and expecta- 
tions of the system. In the pseudocode we suggest using an exponential function for normalization 
based on the projection of the residual on the components. The exponential function may be 
replaced by neural non-linear contrast enhancement algorithms (for an early reference, see, [6]) 
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required: 

P = (pi, • • • ,Pm) 
initialize: SP and CE 
for t from 1 to tsp 

for r from to tcE — 1, 
execute CE iteration 

output : /C 

r <-x-D[/C]D[/C]tx 

if ll r *l|2 > 1 1 1 1 1 2 then quit 



else: SP-like correction for CE 
e <- D T r 

M = (mi, ..,mfc, ..,mju) <- A4[e] 



P 



exp(-k/K) 



\ r \\2P 



IP'III 



% initial distribution parameters 

% SP iteration main loop 
% CE iteration main loop 

% CE optimized index set 
% compute next residual 
% check for improvement 

% BU step of SP 

% ordered index set of e 

% auxiliary Bernoulli distribution 

with ss K number of Is on average 
% weigh by residual's norm 

to improve distribution 
% normalize for K to draw 

K number of Is on average 



end loop 



Table 1: Pseudo-code of the subspace cross-entropy (SCE) method for Bernoulli distributions 



like soft winner-take-all methods (WTA) often used for competitive learning in neural networks in 
tasks like distributed decision making, pattern recognition or modeling attention ( [71l44j). WTA 
networks use mutual inhibition to select the largest (or for k-WTA the first k largest) elements of 
the input set. The sharp nonlinearity of the k-WTA procedure corresponds to the SP expansion 
method (Table EJ. 

3 Simulation results 

We performed extensive computer simulations in order to compare the performance of the combined 
algorithm to some well established solutions, including direct linear programming methods ('/i- 
Magic' [HE]) and SP of |10j . To see the impact of SP on CE, we also included the results of a 
standard batch CE implementation as we have already shown the equivalence of the online and batch 
versions of CE (in terms of global convergence and accuracy) [25] . In order to provide unbiased 
comparison, we tested these methods on a synthetic problem for which the optimal solution is 
always known. For each comparison 10 reconstruction matrices were generated. Matrix elements 
were drawn randomly from Gaussian distribution with zero average and unit variance then the 
columns were normalized to 1. For each matrix 10 distinct random binary vectors were selected as 
sparse internal representations (coefficients). Input signals were then generated as products of the 
internal representations and the matrices. Results are averaged over 100 runs. 

The dimension of the inputs, the sparse, overcomplete representations, and the number of 
nonzero components of the sparse representations (dim(x), dim(a) and K, respectively) influence 
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the computational burden of the test. The task is to reconstruct the signals by estimating the 
sparse coefficients. We studied the run time and the reconstruction quality as a function of over- 
completeness and sparseness. All simulations were run on a single core PC. Note that we did not 
exploit the parallel nature of the batch CE method, which would add an enormous speed-up in run 
time over serial methods. Instead, all algorithms were implemented in Matlab (Mathworks, Natick, 
MA) and run in the same environment. Table [2] summarizes the parameters of the algorithms. 



Algorithms 


Parameters 


Values 




primal dual tolerance 

max. number of primal-dual iterations 


0.001 
50 


SP 


parameter free (except K) 






reconstruction error when CE stops 


e = 0.1/K 




elite ratio 


p = 0.05 


CE 


update factor for p 


a = 0.1 




number of batch samplings 


100 




sample size in one batch 


500 




reconstruction error when CE stops 


e = 0.1/K 




elite ratio 


p = 0.05 


SCE 


update factor for p 


a = 0.9 




number of batch samplings 


6 




sample size in one batch 


100 



Table 2: Parameters of the different algorithms. K is the number of nonzero components. 
The stopping criteria were either the maximum iteration number or the reconstruction error. 
We used the default setting for 'Zi-Magic' [5]. SP was implemented as given in [10]. Parameters 
of batch CE and SCE methods were optimized for these tests by hand. 

On Fig. Q] three graphs are shown to characterize the performance of Zi-Magic, SP, CE and 
SCE algorithms. Note that CE should find the optimal solution if allowed to run long enough. 
This is not the case for Zi-Magic and SP. Conditions of the theorems of Zi-Magic are roughly as 
follows [1]: solution can be found provided that the dimension of the input is larger than a constant 
(on the order of 4) multiplied with the number of non-zero coefficients and the logarithm of the 
size of the representation. Note that the condition may be spoiled by increasing the size of the 
internal representations. SP, for large representations, may be stuck in local minima, since it halts 
if reconstruction error is not improved at a given iteration. Last, but not least, both Zi-Magic and 
SP assume that the number of non-zero components is given beforehand, which is not needed for 
CE and SCE. 



The first and second subplots (Fig. 1(a) and Fig. 1(b) ) show the mean (±std) run time on linear 
and logarithmic scales, respectively, as a function of the size of the internal sparse representation. 
The logarithmic scale shows the fine structure of the dependencies, but deviations get distorted. The 
linear scale demonstrates the order differences among the methods and also shows the symmetric 



deviation around the mean. The third plot (Fig. 1(c)) depicts the ratio of good solutions as a 
function of the size of the internal sparse representation. The different methods fail to provide good 
solutions if they reach the maximum iteration number without converging to a solution or converge 
into an incorrect solution. Quite remarkably all methods show graceful degradation of performance 
even for internal dimensions on the order of 1,000. Nonetheless, there are clear advantages for 
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Figure 1: Performance and run time of different sparse coding methods as a function of the 
size of the representation. Compared methods: Zi-Magic, (batch) CE, SP and subspace CE 
(SCE). (a)-(b) Average run time for the different methods as a function of overcompleteness, 
i.e. the size of the internal representation on linear and logarithmic plots, respectively. Size of 
input: dim(x) = 64, number of active components: K = 8. (c) Ratio of good solutions (ratio 
of found good components) as a function of overcompleteness. Note: Zi-norm and SP methods 
used the correct number of non-zero components. The strict value of this number is not needed 
for CE and SCE methods 



the CE based methods for very sparse representations: SP and Zi-Magic methods reach 'critical 
sparseness' |10| much earlier. The SCE/CE advantage could be further increased by allowing longer 
iterations. 

The reconstruction performance of SCE is the same as that of CE within measurement error. 
However, SCE runs orders of magnitudes faster than CE and li-magic. The speed of SCE is 
due to two factors. One factor is that SCE is using relatively few time consuming bottom- up 
transformations. The other factor is the principled introduction of mutual influence between the 
probability update of the individual components that incorporates the advantages of the SP method. 
In comparison with pure SP, SCE is only an order of magnitude slower than SP. Nevertheless, we 
expect to see further acceleration by extending the algorithm with a predictive module utilizing 
the CE probabilities not accessible by the pure SP. 

The impact of complexity on the different methods has been tested by changing the number of 
non-zero components in the overcomplete representations. Figure [2] summarizes the results. Here 
again the running time (Fig. 2(a) ) and the ratio of good solutions (Fig. |2(b)[ ) are plotted, but 
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Number of nonzero elements Number of nonzero elements 



(a) (b) 

Figure 2: Performance and run time of different sparse coding methods as a function of the 
number of non-zero elements of the representation. Compared methods: /i-Magic, (batch) 
CE, SP and subspace CE (SCE). (a) Run time for the different methods as a function of 
sparseness, i.e. the number of non-zero elements of the internal representation. Size of input: 
dim(x) = 64, size of the overcomplete representation: dim(a) = 1024. (b) Ratio of good 
solutions (reconstruction quality) as a function of overcompleteness. Note: ^i-norm and SP 
methods used the correct number of non-zero components. The strict value of this number is 
not needed for CE and SCE methods 

now as a function of the number of non-zero components. In this test SP seems superior, but 
the combined method still performs well (increased number of iterations would allow even better 
performance at the price of longer run time) . Furthermore as the divergence ratio (size of internal 
representation/size of input) increases, SP requires increasingly more iteration for calculating the 
reconstruction error, thus resulting in an enormous number of additional synaptic transmissions. 
This disadvantage does not show up in runtime using Matlab on a single PC since Matlab is 
optimized for matrix-vector computations. Let us remark that performance of SP may quickly 
deteriorates if the number of non-zero components is not known beforehand, which is the typical 
case in real scenarios. 

4 Discussion 

According to the expectations brain inspired computations are going to dominate and shape tech- 
nology in the near future. While the most compelling features of neural systems are 1) the extremely 
low energy cost, 2) massively parallel and distributed organization and 3) outstanding resilience 
(robustness against structural and functional perturbations, self-organizing mechanisms for self- 
repair, etc), mainstream digital solutions mostly focus on speed, are organized in a serial fashion 
and even small scale perturbations can make them malfunction or totally dysfunctional. In order 
to make a paradigm shift, we need to have a better understanding of the structural and functional 
constituents of the neural systems responsible for the development and maintenance of these fea- 
tures. In this article we presented a very efficient signal processing approach designed to form 
overcomplete, sparse representations. In theoretical neuroscience the usefulness of such representa- 
tions has long been recognized (e.g. [37J) for its contribution to increased storage capacity, better 
pattern separation and higher noise tolerance. 

Recent discoveries in signal processing now suggest another important factor that must be 
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utilized by evolutionary solutions as well. For most natural signals (i.e. those that are encountered 
most frequently by living systems) very efficient sampling is possible by exploiting their intrinsic 
sparseness. The fact that signals can be recovered by a small set of samples has serious impact 
on computational speed as well as on noise tolerance since even partially observed signals can 
also be recovered. A few novel models [9 , 43, 45J have already started to implement these ideas 
in neural computations. These studies either deal with learning efficient dictionaries (in terms of 
reconstruction quality) or draw parallels between sensory processing and compressed sensing [4"lll3j . 
In this article we argued that efficiency in terms of metabolic cost should also be taken into account 
when considering sparse coding mechanisms. Since this constraint is equally important for artificial 
and natural systems we wanted to improve ^i-norm based solutions by explicitly minimizing the 
required number of signal transmission (e.g. by reducing the number of signal projections and 
comparisons, especially when bottom-up transmissions are needed). In contrast to ^i-norm based 
methods, our solution, which is designed to explicitly minimize the active number of components 
(i.e., the Iq norm), can be efficiently implemented in a parallel, distributed system. Furthermore, 
the CE method ensures convergence to the optimal solution [31U47j . which may not be the case for 
ii-norm based solutions if conditions are not met. What is remained to answer is whether the SCE 
algorithm can be implemented in a neurally plausible form. 

4.1 Neural plausibility of the SCE method 

To discuss whether our algorithm makes sense in neural terms, implementation and functional 
issues need to be considered. As the implementation is concerned, although a detailed translation 
of the model into equations used by computational neuroscience is missing, we conjecture that SCE 
with online CE [25] can be expressed in neuronal form. An important property of the online variant 
of CE is that it preserves the convergence properties of the original batch method [55] although it 
relies on threshold adaptation and needs to identify elite samples and update the probabilities at 
each iteration. Since thresholds of individual neurons and neuronal ensembles can be independently 
and locally adapted and the discrete time updates of the algorithm can be rephrased in differential 
forms similar to those describing the changes of membrane potential we believe that the whole 
procedure can be translated into neuronal form. 

As it was demonstrated our solution is indeed a very efficient sparse coding solution with 
local rules supporting distributed implementations. However, SP based update, while considerably 
accelerating the computations, does not seem to support lifetime sparseness (at least transiently). 
This shortcoming may, for example, be compensated by feedforward inhibitory thresholding [45j . 

/c-WTA is another algorithmic component of our computations. In SCE, its role is to keep the 
number of active components low and to update the probabilities of the components. WTA methods 
have been already proposed as a computational primitive in the neural system On a particular use 
of fc-WTA in modeling the hippocampal computations, see |3Q], whereas for a recent paper on the 
computational power of neural WTA methods, see, e.g., |27| . 

The proposed algorithm uses two distinct quantities: the probability of firing (feature selection) 
and the activity that reconstructs the input (coefficients of the features). Whether this dichotomy 
can be mapped onto assumed forms of neural computations is an open question. There are many 
potential solutions, ranging from local neuronal circuits, through considerations on population 
codes |26p41j to models that neurons themselves represent multi-layer networks exhibiting a range 
of linear and non-linear mechanisms [23]. Future work will focus on this issue. 

It has been long debated if the nature of neocortical sensory processing is mostly feedforward or 
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feedback, since even the first spikes after stimulus onset may carry enough information in e.g. recog- 
nition tasks. j57j . While most arguments are about the expected speed of information processing, 
the capacity of the generative networks or the impact of anatomical and physiological constraints 
(see, e.g., [T9 | l22 | l42 p 5C H l5T] ) . we find that a particular combination of feedforward (bottom-up) and 
feedback (top-down) processing provides the most favorable solution for sparse coding in terms of 
metabolic cost. Given that the statistics of natural signals supports sparse representations, ecologi- 
cal solutions, like this combined process, may have evolutionary advantage over purely feedforward 
or feedback solutions. 

In respect to the criticisms of |45] against previously proposed neural sparse coding mechanisms 
our solution 1) is neurobiologically plausible (using decentralized and local interactions at low 
synaptic activity), 2) can generate exactly zero- valued coefficients (since it directly minimizes the 
number of active components) and 3) it is based on a principled approach as opposed to heuristic 
approximations applied in other proposals. However, the A th issue about non-smooth variations in 
the coefficients for time-varying stimuli has not yet been addressed. Its importance stems from the 
belief that prediction is probably the most fundamental task of the neural system |23j . Hence any 
model claiming to explain some aspects of neural representation should provide means to support 
this central task. Currently we are seeking methods to make the proposed algorithm applicable 
for predicting time-varying sparse signals. Future work will show if the emerging probabilistic 
representations can be used e.g. in a belief propagation network thus supporting prediction. 

Because all algorithmic steps can be realized by local interactions keeping a low energy profile, 
we argue that the emerging architecture might be a good model of neural sparse coding in sensory 
information processing. Although the generation of candidate representations and the update of 
the component probabilities require recurrent interactions, the constraint to minimize reentrant 
activity (i.e. reducing the intra-layer synaptic activity) results in apparent and approximately 
correct fast feed-forward working. 

Beyond the low metabolic costs of the proposed algorithm, the Zo- norm based optimization 
provides additional flexibility over simple Zi-norm optimizations by not fixing the number of active 
components beforehand. Methods based on the Zi-norm (like SP) may fail if the optimal sparsity 
for a given stimuli differs from the predefined value. We note that for all tests li-norm based 
optimization schemes would have provided very poor approximations had they been restricted to 
keep a single ratio of non-zero components and the size of the representation (Fig. [1]) or to keep a 
single number of non-zero components (Fig. [2]). 

5 Conclusions 

We have proposed a novel solution for sparse coding by combining our previous work [25] on 
spike based probabilistic optimization using lo-norm with an efficient Zi-norm based method. The 
motivation behind this integration is that /i-norm based methods are computationally attractive 
and also provide the optimal solution under certain conditions [H[T3] while our lo-norm based 
solution using the probabilistic cross-entropy method provides a potential neural interpretation 
and does not require the exact knowledge of the number of non-zero components. 

The combined method has several outstanding features making it an interesting candidate for 
neural sparse coding in the sensory systems: computations are local, distributed and 'transmission 
efficient' with low metabolic cost. In addition, the algorithm is less affected by constraints of other 
alternatives based on Zi-norm optimization. 
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7 Appendix 

7.1 Pseudocode for Subspace Pursuit 



input: 



k = K/M, x € 


% sparsity and signal 


tsp 


% max iteration number 


D € R NxM 


% M element dictionary 


initialization: 




K = MaxIndx:(D T x) 


% index set of size K 


D = D[/C] 


% .fT-truncated matrix 


r <- x - DDtx 


% compute residual 


optimization: 




for t from 1 to tgp 


% iteration main loop 


compute /C[D r r] 


% index set for expansion 


£ <- K U /C[D T r] 


% increase set size (L = 2K) 


e <- D[£]t x 


% compute projections 


K <- Maxlnd^(e) 


% new index set of size K 


D <- D[/C] 


% .fT-truncated matrix 


ri <- x - DD^x 


% compute residual 


if r = then quit 


% finish is residual is zero 


if r* 2 > r* _1 2 then 


% check for improvement 


t = t S p 


% no new iteration 




% use previous index set 



quit 
end loop 



Table 3: The pseudocode of the Subspace Pursuit method of [10]. The significant difference 
between this and other iterative MP methods is the refinement of the already chosen subset of 
basis by testing the projection of the original signal onto the current basis set. 



7.2 CE method for Bernoulli distribution 

Let the domain of optimization be y € y = {0, 1} M , and each component be drawn from inde- 
pendent Bernoulli distributions, i.e., Q = BER M . Each distribution g £ Q is parameterized with 
an M-dimensional vector p = (pi, . . . ,pm). Component j of the sample y G y £ y drawn from 
distribution g thus will be 

J 1, with probability pj\ 
^ J 1 0, with probability 1 — pj . 

Having drawn / samples yW, . . . ,y^\ let L 7 denote the level set of the elite samples, i.e., 

L, :={y« |/(y (t) )< 7 } 
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where 7 denotes a fixed threshold value. The distribution g' with minimum cross-entropy distance 
from the uniform distribution over the elite set has the following parameters 

p' := {Pi,- ■ ■ ,Pm), where 
, Y^yU^xivf = 1) _ Ey(OgL 7 X(2/j t) = l) 

Pi ' 1 T \ ' / 

where x(.) is an indicator function. In other words, the parameters of 5' are simply the compo- 
nentwise empirical probabilities of Is in the elite set. Changing the distribution parameters from 
p to p' can be too coarse, so in some cases, applying a step-size parameter a is preferable. The 
resulting algorithm is summarized in Table 01 



required: 

P = (Pi, • • • ,Pm) 

T, 

I 

P 

for t from 1 to T 

for % from 1 to I 

draw yW from BER M (p t ) 
compute /*:=/(yW) 

sort /*-values 

7 <- /p-/ 

^ 7 ^{y (i) l/(y (4) )<7) 

p^ (E y <^ 7 = i))/(p-r) 

Pi a ■ + (1 - «) -pj 
end loop 



% initial distribution parameters 

% max iteration number 

% population size, 

% selection ratio 

% CE iteration main loop 

% draw / samples 
% evaluate them 

% order by decreasing magnitude 

% level set threshold 

% get elite samples 

% get parameters of nearest distrib. 

% update with step-size a 



Table 4: Pseudo-code of the cross-entropy method for Bernoulli distributions 



References 

[1] G. Allon, Dirk P. Kroese, T. Raviv, and Reuven Y. Rubinstein. Application of the cross-entropy method 
to the buffer allocation problem in a simulation-based environment. Annals of Operations Research, 
134:137-151, 2005. 

[2] Pietro Berkes, Ben White, and Jozsef Fiser. No evidence for active sparsification in the visual cortex. In 
Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural 
Information Processing Systems 22, pages 108-116, 2009. 

[3] S. Boyd and L. Vandcnberghe. Convex Optimization. Cambridge University Press, 2004. 

[4] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from 
highly incomplete frequency information. IEEE Trans. Inf. Theory, 52:489-509, 2006. 

[5] W. Candes and J. Romberg, ii-magic: A collection of matlab routines for solving the convex optimization 
programs central to compressive sampling, online, 2006. http://www.acm.caltech.edu/llmagic/. 



16 



[6] G. A. Carpenter and S. Grossberg. ART2: self-organization of stable category recognition codes for 
analog input patterns. Applied Optics, 26:4919-4930, 1987. 

[7] G.A. Carpenter and S. Grossberg. A massively parallel architecture for a self-organizing neural pattern 
recognition machine. Computer Vision, Graphics, and Image Processing, 37:54-115, 1987. 

[8] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on 
Scientific Computing, 43(1):129-159, 2001. 

[9] W. K. Coulter, C. J. Hillar, G. Isely, and F. T. Sommer. Adaptive compressed sensing - 
a new class of self-organizing coding models for neuroscience. accepted at ICASSP2010, 2010. 
|http : //www.msri . or g/people/members/cfiillar/f iles/ACS-ICASSPIO .pdf . 

[10] W. Dai and O. Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE 
Transactions on Information Theory, 55(5):2230-2249, 2009. 

[11] Pieter-Tjerk dc Boer, Dirk P. Krocsc, Shic Mannor, and Reuven Y. Rubinstein. A tutorial on the 
cross-entropy method. Annals of Operations Research, 134:19-67, 2004. 

[12] D. Donoho, Y. Tsaig, I. Drori, and F. Starck. Sparse solution of undcrdctcrmined linear equations 
by stagewise orthogonal matching pursuit, 2006. Technical Report 2006-02, Stanford, CA: Stanford 
University. 

[13] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52:1289-1306, 2006. 

[14] D. J. Field. What is the goal of sensory coding. Neural Computation, 6:559-601, 1994. 

[15] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Compu- 
tation, 10:1455-1480, 1998. 

[16] D. J. Graham and D. J. Field. Sparse coding in the neocortex. In Jon H. Kaas and Leah A. Krubitzer, 
editors, Evolution of Nervous Systems, volume III, pages 181-187. Elsevier, 2006. 

[17] Jonathan Keith and Dirk P. Kroese. Sequence alignment by rare event simulation. In Proceedings of 
the 2002 Winter Simulation Conference, pages 320-327, 2002. 

[18] S. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. An interior-point method for large-scale 
11-regularized logistic regression. Journal of Machine Learning Research, 2007, 2007. 

[19] C. Koch and T. Poggio. Predicting the visual world: silence is golden. Nature Neuroscience, 2:9-10, 
1999. 

[20] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In In NIPS, pages 
801-808. NIPS, 2007. 

[21] P. Lennie. The cost of cortical computation. Current Biology, 13:493-497, 2003. 

[22] H. Liu, Y. Agam, J. R. Madscn, and G. Kreiman. Timing, timing, timing: Fast decoding of object 
information from intracranial field potentials in human visual cortex. Neuron, 62:281290, 2009. 

[23] R. Llinas. I of the Vortex: From neurons to self. The MIT Press, 2002. 

[24] M. London and M. Hiiusser. Dendritic computation. Annu. Rev. Neurosci., 28:503532, 2005. 

[25] A. Lorincz, Zs. Palotai, and G. Szirtes. Spike-based cross-entropy method for reconstruction. Neuro- 
computing, 71:3635-3639, 2008. 

[26] W. J. Ma, J. M. Beck, P. E. Latham, and A. Pouget. Bayesian inference with probabilistic population 
codes. Nature Neuroscience, 9:1432-1438, 2006. 

[27] W. Maass. On the computational power of winner-take-all. Neural Computation, 12(1 1) :2519— 2535, 
2000. 



17 



[28] B. Mailhe, R. Gribonval, F. Bimbot, and P. Vandergheynst. A low complexity orthogonal matching 
pursuit for sparse signal approximation with shift-invariant dictionaries. In ICASSP'09, IEEE Int. Conf. 
on Acoustics, Speech and Signal Processing, April 2009. 

[29] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. 
Journal of Machine Learning Research, 11:19-60, 2010. 

[30] S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on 
Signal Processing, 41(12):3397-3415, 1993. 

[31] Heinz Muehlenbein. The equation for response to selection and its use for prediction. Evolutionary 
Computation, 5:303-346, 1998. 

[32] D. Mumford. Neuronal architectures for pattern-theoretic problems. In C. Koch and J. L. Davis, editors, 
Large Scale Neuronal Theories of the Brain, pages 125-152. MIT Press, 1994. 

[33] B. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227- 
234, 1995. 

[34] D. Needell and J. A. Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. 
Appl. Comp. Harmonic Anal, 26:301-321, 2008. 

[35] D. Needell and R. Vershynin. Uniform uncertainty principle and signal recovery via regularized orthog- 
onal matching pursuit. Foundations of Computational Mathematics, 9:317-334, 2009. 

[36] B. A. Olshausen. Sparse codes and spikes. In R. P. N. Rao, B. A. Olshausen, and M. S. Lewicki, editors, 
Probabilistic Models of the Brain: Perception and Neural Function, pages 257-272. MIT Press, 2002. 

[37] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse 
code for natural images. Nature, 381:607-609, 1996. 

[38] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed 
by VI? Vision Research, 37:3311-3325, 1997. 

[39] B. A. Olshausen and D. J. Field. Sparse coding of sensory inputs. Current Opinion in Neurobiology, 
14:481-487, 2004. 

[40] R.C. O'Reilly and J.L. McClelland. Hippocampal conjunctive encoding, storage, and recall: Avoiding 
a tradeoff. Hippocampus, 4:661-682, 1994. 

[41] A. Pouget, P. Dayan, and R. Zemel. Information processing with population codes. Nature Reviews of 
Neuroscience, 1:125-132, 2000. 

[42] R. P. N. Rao and D. H. Ballard. Predictive coding in the visual cortex: A functional interpretation of 
some extra-classical receptive-field effects. Nature Neurosci., 2:79-87, 1999. 

[43] M. Rehn and F. T. Sommer. A network that uses few active neurones to code visual input predicts the 
diverse shapes of cortical receptive fields. Journal of Computational Neuroscience, 22:135-146, 2007. 

[44] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 
2:1019-1025, 1999. 

[45] C. J. Rozell, D. H. Johnson, E. G. Baraniuk, and B. A. Olshausen. Sparse coding via thresholding and 
local competition in neural circuits. Neural Computation, 20(10):2526-2563, 2008. 

[46] R. Rubinstein and D. P. Kroese. The Cross-Entropy Method. Springer- Verlag, New York, 2004. 

[47] R. Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Method- 
ology and Computing in Applied Probability, 2:127-190, 1999. 

[48] F. Santosa and W. W. Symes. Linear inversion of band-limited reflection seismograms. SIAM J. Sci. 
and Stat. Comput., 7:1307-1330, 1986. 



18 



M. Seeger. Bayesian inference and optimal design in the sparse linear model. Journal of Machine 
Learning Researc, 9:759-813, 2008. 

T. Serre, A. Oliva, and T. Poggio. A feedforward architecture accounts for rapid categorization. Pro- 
ceedings of the National Academy of Science, 104:64246429, 2007. 

A. M. Sillito, J. Cudciro, and H. E. Jones. Always returning: feedback and sensory processing in visual 
cortex and thalamus. Trends in Neurosciences, 29:307 - 316, 2006. 

E. C. Smith and M. S. Lewicki. Efficient auditory coding. Nature, 439:978-982, 2006. 

Z. Szabo, B. Poczos, and A. Lorincz. Cross-entropy optimization for independent process analysis. In 
Justinian Rosea, Deniz Erdogmus, Jose C. Principe, and Simon Haykin, editors, Independent Component 
Analysis and Blind Signal Separation, volume 3889 of Lecture Notes in Computer Science, pages 909- 
916. Springer, 5-8 March 2006. 

I. Szita and A. Lorincz. Learning to play using low-complexity rule-based policies: Illustrations through 
Ms. Pac-Man. Journal of Artificial Intelligence Research, 30:659-684, 2007. 

I. Szita and A. Lorincz. Online variants of the cross-entropy method, 2008. 

|http : //arxiv . org/abs/080~1 988. 

T. Tao. Preprints in sparse recovery /compressed sensing, 2008. 

http : //www . math . ucla . edu/~tao/preprint s/ sparse . html 



S. J. Thorpe. How can the human visual system process a natural scene in under 150ms? Experiments 
and neural network models. In Michel Verlcyscn, editor, Eurorean Symposium on Artificial Neural 
Networks. D-Facto public, 1997. 

J. A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on 
Information Theory, 50(10) :2231-2242, 2004. 

W. Vinje and J. Gallant. Natural stimulation of the nonclassical receptive field increases information 
transmission efficiency in vl. Journal of N euro science, 22:2904-2915, 2002. 

Y. Weiss, H. Chang, and W. Freeman. Learning compressed sensing, 2007. (in press) 
|http : //www . cs . huj i . ac . il/~y weiss/allert on -final . pdf [ 

B. Willmore and D. Tolhurst. Characterising the sparseness of neural codes. Network: Comput. Neural 
Syst, 12:255-270, 2001. 



19 



